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Abstract 
Background 

Significance analysis plays a major role in identifying and ranking genes, transcription factor binding sites, DNA 
methylation regions, and other high-throughput features for association with disease. We propose a new approach, 
called gene set bagging, for measuring the stability of ranking procedures using predefined gene sets. Gene set 
bagging involves resampling the original high-throughput data, performing gene-set analysis on the resampled 
data, and confirming that biological categories replicate. This procedure can be thought of as bootstrapping 
gene-set analysis and can be used to determine which are the most reproducible gene sets. 

Results 

Here we apply this approach to two common genomics applications: gene expression and DNA methylation. Even 
with state-of-the-art statistical ranking procedures, significant categories in a gene set enrichment analysis may 
be unstable when subjected to resampling. 

Conclusions 

We demonstrate that gene lists are not necessarily stable, and therefore additional steps like gene set bagging can 
improve biological inference of gene set analysis. 
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Background 

The biology of many organisms is organized naturally as a series of diverse pathways, and the genetic 
landscape of cells is no different - genes also group together in pathways to perform specific functions YU . 
Human health depends on the functionality of these pathways; de-regulation at the pathway level may be 
more important for diseases like cancer than de-regulation of specific genes [2] . The most common statistical 
approach for identifying pathways of interest in a high-throughput experiment is to perform a significance 
analysis gene-by-gene and then summarize the significant hits using gene set or gene pathway analyses. Each 
pathway or gene-set analysis is performed once on the entire data set. However, there is variability in the 
identified gene sets due to both the instability in gene rankings from the original gene ranking analysis and 
from the pathway/set analysis. 

Here we propose a new approach to evaluate the stability of biological inference drawn from an experiment. 
Our approach, called gene set bagging, performs a resampling of the entire discovery algorithm - significance 
analysis and gene set enrichment - to identify the most stable and reproducible enriched gene sets. We 
perform resampling by drawing an equal number of samples with replacement from the full (observed) 
dataset, performing a significance analysis followed by gene set analysis, and then identifying which sets 
are enriched. We can identify which observed gene sets are consistently enriched in resampled data, and 
compute the gene set replication probability (R), a measure of gene set stability based directly on the 
biological quantity of interest, representing the probability that an observed gene set will be enriched in 
future experiments. 

The replication proportion (R) has some important advantages over the traditionally-reported p- value for 
summarizing gene set enrichment. The structure of the gene set testing problem is fundamentally different 
than other multiple hypothesis testing problems - correlations between genes, different gene set sizes, and 
different levels and fraction of differential expression within gene sets makes the hypotheses fundamentally 
not comparable with standard significance testing [3|4] . We therefore instead propose to estimate directly the 
probability that a gene set will replicate, as in this more complicated multiple testing scenario, an estimate 
of the probability of replication may be of more interest than a measure of statistical significance. Lastly, 
given the emphasis for replication in genetics/genomics studies, this replication proportion may be another 
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metric for directing molecular validation of important biological processes involved in human disease. 

We perform our gene set bagging method on two genomics measurements: gene expression and DNA 
mcthylation. Even after adjusting the genomic data for potential batch effects, we demonstrate that some 
significant gene sets fail to replicate well, yet other non-significant sets have high replication rates. The results 
for these different genomic technologies suggest that the signal and noise structure of the specific genomic data 
type contribute greatly to stability of gene sets. We use a simulation study to assess replication across two 
simulated datasets, and evaluate the concordance between replication probability (i?) and the traditionally- 
reported significance metric (p- value) . We show that the replication probability better quantifies the chance 
that a significant gene set will be consistent across studies, and the result of our analyses suggests that: (f ) 
gene set enrichment analyses from a typical high-throughput study may be highly unstable, and (2) gene 
set bagging is a resampling approach for measuring the stability of gene sets and ensuring reproducibility of 
biological conclusions. 

Methods 

Bagging, also known as bootstrap aggregating, is traditionally used for assessing the predictive accuracy 
and stability of prediction models [5]. While bagging procedures have been used for differential expression 
analyses [6] , here we introduce a new bagging procedure for significance analysis of gene sets called gene set 
bagging. This procedure can be useful for both evaluating significance rankings and also for describing the 
most reproducible genes and biological gene sets within genomics experiments in a platform-independent 
fashion. 

For gene set I, the goal is to estimate: 

Ri = Pr(Gene set 1 will be significant in a new study). 

The quantity Ri is useful as a measure of the stability of the significance of an identified gene set. Gene sets 
are frequently used to interpret the biological results of studies, so it is important to know if the biological 
"story" would change if the study was repeated. This is particularly true since gene set analysis is subject 
to errors in annotation, variation due to technological noise, and variation due to biological noise. 

As an example of our general approach, we focus on a cigarette smoking dataset (further explained in the 
following Datasets and Implementation section), which examined gene expression differences associated with 
smoking exposure in 40 smokers and 39 never-smokcrs. Wc define gene expression measurements m^- for 
each of j = 1, ... ,79 samples over i = 1, . . . , M genes/probes (corresponding to gene g{) and a covariate of 
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interest per sample (zj € [currentsmoker, never smoker]). We first want to identify differentially expressed 
genes between the two outcome groups, so we calculate an empirical Bayes t-statistic and resulting p-value 
for each gene [7]. We can call any gene significant if a < 0.05 (or, alternatively, we can assume a to be the 
family- wise error rate as to control for multiple testing), and look for enrichment in L predefined gene sets. 
Each gene set gets a p-value (pi), reflecting the degree of enrichment. The prevalent wilcoxon mean rank 
gene set enrichment test [8] available in the limma Bioconductor package [9^ and traditional hypergeometric 
test were used as enrichment tests for each dataset. 

We can then perform gene set bagging using B = 100 iterations. In each iteration (&), we resample the 
40 smokers and 39 smokers separately with replacement. Each gene or probe gets a p-value via calculating 
a t-statistic in the resampled data, and these statistics are passed to gene set analysis algorithms to produce 
a p-value of enrichment for each gene set (pi), which are stored in the first column of a [#gene sets by B] 
matrix. 

In the next iteration, we resample again and create a different resampled dataset, and get another set of 
p- values that are put in the second column of the storage matrix. We fill in the columns of the matrix if we 
do this 100 times. For each row, which represents a single gene or probe, we count the number of times each 
subsampled p-value {p\) is less than a (here, 0.05), and divide it by the number of iterations (B), resulting 
in an estimate of the replication probability for that gene set (Ri ) . 

Regardless of the application, estimated replication probabilities (R) are between and I, where means 
that the gene set always had a p-value greater than a in every iteration, and 1 means that the category 
always had a p-value less than a in each iteration. For analyses where the gene ranking is stable and 
the gene set calculation is stable, the replication probability will be higher. This estimate of replication 
assesses the stability of the gene sets, and might be a better estimate of biological reproducibility than the 
traditionally reported p- values. Our goal is to identify the more stable set of gene sets, akin to Meinshausen, 



and Biihlmann 2010 10 in selecting a more stable set of covariates in a regression model. 



Datasets and Implementation 

Simulated Data 

We designed two simulation studies to assess different properties of the replication probability based on the 
Affymetrix Human Genome 133 Plus 2.0 gene expression microarray. Basing the simulation on an existing 
array design, with probes annotated to genes that were already mapped to gene ontology categories, allowed 
us to realistically add differential expression signal to specific gene sets. We first selected a random sample of 
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Algorithm 1 Gene set bagging procedure 



1. Estimate a test statistic for each gene Tj 

2. Use the test statistics to calculate a P-value for each gene set, pi, I = 1 . . . ,L, using any standard gene 
set analysis algorithm. 

3. For (b€l,...,B): 

i. Resample individuals within outcome groups 

ii. Estimate a bootstrap test statistic for each gene T* b 

iii. Use the test statistics to calculate a bootstrap p-value 
for each gene set, pf b , I = 1 . . . , L, using any standard gene 
set analysis algorithm. 

4. Estimate the replication probability Ri = 6=1 4p — — for each gene set. 



100 gene sets to use in our simulation, which corresponded to 2288 unique genes. Then, for each simulation, 
we selected first 100 and then 500 genes (generally denoted G) to insert signal into, via the following model: 



rriij = A) + PiZj + eij 

where e^- ~ A^(6, 1) and ~ A^(l,0.5) if gi € G, and my and Zj (defined above) correspond to the 
expression value and binary outcome respectively. 

In the first simulation, we generated 1000 datasets, where each consisted of 100 individuals (50 cases and 
50 controls). For each dataset, we inserted signal into 100 genes and computed the observed p-value (pi) and 
then the replication probabilities (Ri) for each gene set In the second simulation we generated 100 pairs 
of datasets, where each dataset contained 50 individuals (25 cases and 25 controls), and inserted signal into 
500 genes, and computed observed p-values and replications probabilities for each gene set. 



Gene expression: cigarette smoking data 

We tested the gene set bagging method in a differential expression analysis with data obtained from Gene 
Expression Omnibus (GSE17913). This study examined the effect of cigarette smoking on the oral epithelial 
transcriptome by comparing buccal biopsies in 39 never-smokers with 40 active-smokers using the Affymetrix 



Human Genome U133 Plus 2.0 microarray 11 . We processed the raw CEL files using the RMA algorithm 
to perform intra-array normalization and then performed quantile normalization to adjust for bctween-array 
biases [l2] . 



We performed surrogate variable analysis (SVA) to adjust for potential batch effects 13 14 . Briefly, 
this approach identifies the number of right singular vectors that are associated with more variation than 
expected by chance, and then in the subsets of genes driving this variation, constructs a 'surrogate' variable 
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for each subset. These surrogate variables are then included as covariates in our differential expression 
analysis (so that the model becomes: rriij = f3- t z + ftsvSV + ey). 

We identified differentially expressed genes comparing cases and controls while controlling for the surro- 



gate variables using an empirical Bayes approach 15 . To determine statistical significance, resulting p- values 



were converted to q-values to control for the false discovery rate 16 and all transcripts with q-values less 
than 0.05 were considered significant. We performed a full gene ontology analysis, and then ran the gene set 
bagging algorithm. 



DNA methylation: brain tissue 

We believe this approach to be generalizable to most genomics platforms, and first tested this hypothesis using 
DNA methylation data processed on the Illumina HumanMcthylation27 platform using freely available data 
from GEO [17] [GSE15745]. Our analysis utilized DNA methylation data from a recent paper that assessed 
quantitative trait loci using methylation and expression data in four different brain tissues. Previous work has 
identified that DNA methylation signatures can distinguish brain tissues, and might play a role in determining 
and stabilizing normal brain differentiation [18] . We conducted our gene set bagging algorithm on the 
differential DNA methylation analysis between the frontal and temporal cortices. Detailed preprocessing 
information can be found in the supplementary material. 

We performed the full differential methylation analysis comparing 131 front cortex and 126 temporal 
cortex samples, adjusting for plate number, tissue bank site, sex, and age, using the exact same approach as 
the gene expression example (with and without SVA, then empirical Bayes and multiple testing correction 
on each). All probes with q-values less than 0.05 were considered significant. We performed a full gene 
ontology analysis on the gene associated with each probe (from the annotation table), and ran the gene set 
bagging algorithm. 

Interpretation 

The replication probability (R) reflects stability 

The interpretation of the replication probability reflects the underlying stability of each outcome group. 
Suppose we observe a p- value for a gene set that we call significant at significance level a. The replication 
probability estimate is defined as the fraction of times that feature is significant at the same a level in 
resamplings of the original data. If we called all p- values significant at a = 0.05, R = 0.8 means that feature 
had a p-value less than 0.05 in 80% of resampling datasets. If the statistical signal is stable, significant 
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features will have high replication probability estimates, and non-significant features will have low replication 
probability estimates (R ss 0), because the resampled data should be representative of the overall population. 

To better understand how the replication probability therefore addresses stability, suppose we perform a 
gene expression experiment, and further study two gene sets: Set A with p = 0.001 and Set B with p = 0.2. 
We then calculate the replication probability for these two gene sets, and want to interpret the results. 
Consistency between the replication probability and p-value means that the direction of statistical inference 
is identical: high replication probabilities with low p-values are consistent. 

Table 1: Interpretations of sequential replication probabilities (R) for two different experiment features. 



R 


"Feature A (p = 0.001) is significant, 





but very inconsistent" 


0.25 


but very inconsistent" 


0.5 


inconsistent" 


0.75 


somewhat consistent" 


1 


very consistent" 






R 


"Feature B (p = 0.2) is non-significant, 





very consistent" 


0.25 


somewhat consistent" 


0.5 


inconsistent" 


0.75 


very inconsistent" 


1 


very inconsistent" 



To understand the apparently counterintuitive interpretations, we will focus on the row in Table [l] where 
R = 1.0. Feature A is extremely stable because the significance of this feature seems constant throughout 
resampling. Feature B is also extremely stable because the resampling inferences are relatively non-variable, 
but the replication probability is very inconsistent with the p-value. Interpreting stability and consistency 
through the replication probability therefore requires observed p-values for gene sets. However, it is important 
to remember that the replication probability is a function of the statistical significance level (a); algebra can 
demonstrate that R increases as a increases. 

R estimates the probability a gene set will be significant in a repeated study 

We simulated 1,000 identical data sets (as described in the Datasets section; Simulation 1), these data 
sets represent repeated experiments performed under the same conditions. We spiked in specific genes as 
differentially expressed in these simulated data sets. We then performed gene set analysis using both the 
hypergeometric and Wilcoxon tests and calculated the replication probability estimates for each gene set in 
each of the 1,000 simulated studies. The average replication probability estimate across all 1,000 repeated 
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studies very closely approximates the frequency that a gene set is observed to be significant in those 1,000 
studies (Figures [TJ\ and |Tj3) . In other words, the estimate of the replication probability is close to the 
probability a gene set will be significant in a future study. 



Replication adds biological interpretability 

In the gene expression dataset (Figure [2]), there were 8 GO categories with p > 0.05 and R > 0.8 under the 
hypergeometric test, including sets associated with phosphorylation (GO:0006468, GO:0016310), a process 



affected by cigarette smoking 19 and regulation of metabolic processes (GO:0019222, GO:0044267). Sim- 
ilarly, examining the categories associated with DNA methylation differences across brain tissue that had 
at least moderate replication and non-significant p-values demonstrates support for the gene set bagging 
approach as well as the shortcomings of relying on strict p-values cutoffs for gene ontology analysis (Figure 
[3]). Several biologically plausible GO categories for a comparison of methylation differences in brain tis- 
sues fell into the "marginally significant" bin of observed p-values between 0.05 and 0.1 but had consistent 
replication. 

Similarly, there were many smaller gene sets that had statistically significant p-values (p < 0.05) but 
never appeared in any of the resampled datasets (R=0) in both the gene expression (N = 32) and DNA 
methylation datasets (N = 12). These represent very unstable gene sets, and should be interpreted with 
caution. Categories like the first set (p > 0.05, R > 0.8) would have been ignored in a traditional gene 
set analysis given their statistical significance measure, but might be biologically important to the question 
of interest. Likewise, gene sets in the second category (p < 0.05, R = 0) are probably less biologically 
meaningful even though they are "statistically significant". 



Results 

R correlates better with replication in independent datasets 

Besides identifying which gene sets are the most stable, we can also assess how well the replication probability 
(R) reflects biological replication by spiking-in gene set enrichment across two independent simulated datasets 
(described fully in the Datasets section). We performed traditional gene ontology analysis on both datasets, 
obtaining p-values for each gene set and study calculated from the hypergeometric distribution, and then 
performed our gene set bagging algorithm on these same two datasets. There was very strong Spearman 
correlation between datasets across 100 simulation runs when all gene sets were considered regardless of 
whether the replication proportion (median=0.854, IQR: 0.826-0.876) or p-value (median = 0.836, IQR: 
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0.809-0.869) was used (Figurejlp). However, when only gene sets where at least 1 of 2 datasets was significant 
at p < 0.05 per simulation run, the replication proportion had much stronger correlation (median = 0.755, 
IQR =0.678-0.817) than the p-value (median = 0.535, IQR: 0.387 - 0.648) (Figure[lp). These results suggest 
that globally, there might not be a large difference between the replication proportion and the p-value, but 
when there is any signal in a particular gene set, the replication proportion better captures independent 
replication of that set in future studies. We also performed the more robust Wilcoxon rank rest on these 
simulated paired datasets, which also had less correlation between the resulting p- values than the replication 
probability (Figure []£). There were many fewer significant gene sets by this enrichment approach than the 
hypergeometric test, and it was rare that both independent datasets within a simulation were significant 
at p < 0.05. The p-values derived from the Wilcoxon method therefore appear more conservative than 
collapsing each gene set into a 2x2 table and performing the hypergeometric test. 



Relationship to the problem of regions 

The set of test statistics corresponding to genes within an individual set can be viewed as a multivariate 
random variable. When viewed in this way, a gene set is significant if the vector of test statistics falls into 
a multi-dimensional region defined by the significance threshold. The replication probability is then a first 
approximation estimate of the posterior probability a gene set will be significant, assuming a non- informative 
prior distribution on the vector of test statistics. This problem has been considered in the case of multivariate 



normal data 20 and for estimating confidence in inferred phylogenies 21 . As has been previously pointed 
out, this posterior probability is a reasonable first approximation to the posterior probability in question, 



but should not be interpreted as a frequentist measure of statistical significance 20 22 



As an example of the relationship between the bootstrap and a posterior probability suppose Z\ , . . . , z n ~ 
iV(/z, a 2 ). A non-informative prior distribution for the parameters (^,<t 2 ) is the Jeffrey's prior |23 . The 
Jeffrey's prior for [i is an improper uniform prior across the real line and the Jeffrey's prior for a 2 oc ■ Using 
these prior distributions, the posterior distribution for /i is N(z, t 2 ) where r ~ Inver seW ishart n ^i((ns 2 )~ r ) 
and s 2 = ^X^r=i( z * ~ ^) 2 - I n this case, since /i is one dimensional, the InverseWishart distribution is 
equivalent to an InveseGamma distribution. Drawing bootstrap samples from the Zi and recalculating 
the mean approximates sampling from the posterior distribution of fi (see supplemental R code), ft is 
important to note that the variance of the posterior for [i is inflated compared to a 2 assuming a frequentist 



model 20 22 . Note that the p-values these bootstrap samples should not be interpreted as measures of 
statistical significance, because they are no longer distributed uniformly. 



9 



Discussion 

We have developed a resampling-based strategy for estimating the probability a gene set will replicate as 
statistically significant. This direct approach to estimating replicability may be more useful than statistical 
significance for investigators who aim to identify stable and reproducible biological stories for their results. By 
utilizing outcome-based resamplings of the observed data, the reproducibility of gene sets can be quantified, 
represented by the replication probability R of each gene set category across all subsamples. This approach 
can offer an additional metric beyond the p-value for identifying which biological pathways to follow-up. 
We have successfully applied this method to gene expression and DNA methylation under two commonly- 
used enrichment metrics: the hypergeometric test and the wilcoxon rank test, and demonstrate that many 
seemingly statistically significant GO categories fail to replicate consistently. A strength of our approach is 
the generalizability of this algorithm to most other genomics applications, including following bias-correcting 
approaches like SVA during the analysis, to assess the stability of results lists. 

Overall, between the two most commonly used gene set enrichment measures, the Wilcoxon rank test 
appears more stable than the hypergeometric p-value, using simulated and real data. There were fewer 
inconsistent gene sets (significant either by R or the p-value, but not the other), and the relationship 
between the replication probability and p-value was more precisely defined (Figure |Tj3 and[2|3). 

Gene sets with high replication probabilities and low p-values represent statistically significant, stable, 
and consistent sets that might best represent the underlying biology within the experiment. Given that 
most genomics studies require some form of external replication and that R appears more correlated with 
replication in second follow-up study than p-values alone, we might also suggest following-up gene sets 
that have high replication probabilities (R) even if the p-values are marginally, or even non-, significant. 
From a practical perspective, the gene set bagging algorithm has been turn into the R package "GeneSet- 
Bagging", available through GitHub (https://github.com/andrewejaffe/GeneSetBagging). While defining a 
recommended cutoff seems counterproductive, as in different applications, users may choose different cutoffs 
depending on their resources for replication and how willing they are to be correct, a cutoff value of R 
above 0.5 means that gene set is more likely to replicate than not, and could be used as a lower bound for 
replicability. 

Typical genomics practice often involves drawing the majority of biological conclusions of an experiment 
from the results of a gene set analysis without assessing the stability of the results. We envision replication 
probabilities used in conjunction with standard measures of statistical significance, as the emphasis on 
replication in genetics/genomics makes the replication probability a useful quantity to estimate and use in 
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conjunction with p-values. We have demonstrated that gene lists are not necessarily stable, and therefore 
additional steps like gene set bagging should be undertaken to maximize the biological inference of a given 
study. 
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Figure 1: Replicability assessed by simulation. Differential expression of specific genes was inserted into 
1000 datasets, and both the observed p-values for the (A) hypergeometric and (B) Wilcoxon Rank tests 
and subsequent replication probabilities were calculated. The x-axis is the proportion of observed p-values 
that are less than 0.05 for each gene set and the y-axis is the average replication probability for that gene 
set. Correlations were calculated using the Spearman method to avoid issues with non-linearity. Then, one 
hundred sets (2288 genes) were generated each with two independent simulated expression datasets, with 
differential expression signal inserted with some variability at 500 genes. Gene set tests were performed by 
the hypergeometric test, followed by gene set bagging, and distributions of Spearman correlations between 
independent datasets across 100 simulation runs for (C) all gene sets and (D) those significant in either 
independent dataset at p < 0.05. The replication proportion offers better correlation between independent 
datasets for significant gene sets, but similar correlation across all significant and non-significant gene sets, 
than the p- value for the hypergeometric test. Lastly, (E) overall correlation is improved for the replication 
proportion versus the p-value for the Wilcoxon rank test when all gene sets are considered. 
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Figure 2: Expression dataset gene set analysis. Gene set analyses were performed by the (A) hypergeometric 
and (B) Wilcoxon rank tests using gene sets defined by the Gene Ontology, and the replication of each gene 
set was assessed via our gene set bagging procedure (each point is one gene set). The relationship between 
the estimated replication probability (R) and traditionally reported p- value appears much more stable using 
the Wilxocon rank test. 
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Figure 3: DNA methylation dataset gene set analysis. Gene set analyses and gene set bagging were performed 
by the (A) hypergeometric and (B) Wilcoxon rank tests using gene sets defined by the Gene Ontology. The 
relationship between the estimated replication probability (i?) and traditionally reported p-value appears 
are now only slightly more stable by the Wilxocon rank test, which can be attributed to this dataset having 
approximately three times more samples under study. 
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